Automatic differentiation

Automatic differentiation is a method for evaluating the rate of change in the numerical output of a program with respect to the rate of change in its input. The power of the method is the ability of writing a program that computes a differentiable function and having the derivative immediatelly available.

We start with an example.

Consider the function $$ f(x) = \cos(x)\sin(x) $$

When we want to evaluate the function numerically at a specific $x$, say $x=1$ we can implement a computer program like

x = 1
f = cos(x)*sin(x)

def g(x):
    return cos(x)*sin(x)

x = 1
f = g(x)

Now suppose we need the derivative as well, that is how much $f$ changes when we slightly change $x$. For this example, it is a simple exercise to calculate the derivative symbolically as $$ f'(x) = \cos(x)\cos(x) -\sin(x)\sin(x) = \cos(x)^2 - \sin(x)^2 $$ and code this explicitely as

x = 1
df_dx =  cos(x)*cos(x) - sin(x)*sin(x)

But could we have calculated the derivative without coding it up explicitely, that is without symbolically evaluating it a priori by hand? For example, can we code just

x = my.Variable(1)
f = my.cos(x)*my.sin(x)
df_dx = f.derivative()

def g(x):
    return my.cos(x)*my.sin(x)

x = my.Variable(1)
f = g(x)
df_dx = f.derivative()

to get what we want, perhaps by overloading the appropriate variables, functions and operators? The answer turns out to be yes and it is a quite fascinating subject called automatic differentiation. Interestingly, this algorithm, known also as backpropagation, is in the core of todays artificial intelligence systems, programs that learn how to program themselves from input and output examples. See https://www.youtube.com/watch?v=aircAruvnKk for an introduction to a particular type of model, known as a neural network.

To symbolically evaluate the derivative, we use the chain rule. The chain rule dictates that when $$ f(x) = g(h(x)) $$ the derivative is given as $$ f'(x) = g'(h(x)) h'(x) $$

We could implement this program as

x = 1
h = H(x)
g = G(h)
f = g

where we have used capital letters for the functions -- beware that the function and its output is always denoted with the same letter in mathematical notation. To highlight the underlying mechanism of automatic differentiation, we will always assign the output of a function to a variable so we will only think of the rate of change of a variable with respect to another variable, rather than 'derivatives of functions'. To be entirely formal we write

x = 1
h = H(x)
g = G(h)
f = identity(g)

and denote the identity function as $(\cdot)$. This program can be represented also by the following directed computation graph:

The derivative is denoted by $$ f'(x) = \frac{df}{dx} $$ As we will later use multiple variables, we will already introduce the partial derivative notation, that is equivalent to the derivative for scalar functions. $$ f'(x) = \frac{\partial f}{\partial x} $$

The chain rule, using the partial derivative notation can be stated as \begin{eqnarray} \frac{\partial f}{\partial x} & = & \frac{\partial h}{\partial x} \frac{\partial g}{\partial h} \frac{\partial f}{\partial g} \\ & = & h'(x) g'(h(x)) \cdot 1 \end{eqnarray}

This quantity is actually just a product of numbers, so we could have evaluated this derivative in the following order \begin{eqnarray} \frac{\partial f}{\partial x} & = & \frac{\partial h}{\partial x} \left(\frac{\partial g}{\partial h} \left(\frac{\partial f}{\partial g} \frac{\partial f}{\partial f} \right) \right) \\ & =& \frac{\partial h}{\partial x} \left(\frac{\partial g}{\partial h} \frac{\partial f}{\partial g} \right) \\ & = &\frac{\partial h}{\partial x} \frac{\partial f}{\partial h} \\ & = &\frac{\partial f}{\partial x} \end{eqnarray} where we have included $\partial f/\partial f = 1$ as the boundry case.

So, we could imagine calculating the derivative using the following program

df_df = 1
df_dg = 1 * df_df
df_dh = dG(h) * df_dg  
df_dx = dH(x) * df_dh

If $g$ and $h$ are elementary functions, their derivatives are known in closed form and can be calculated from their input(s) only.

As an example, consider $$ f(x) = \sin(\cos(x)) $$

The derivative is $$ \frac{\partial f}{\partial x} = -\sin(x) \cos(\cos(x)) $$

df_df =  1
df_dg =  1 * df_df
df_dh =  cos(h) * df_dg  
df_dx = -sin(x) * df_dh

As $h=\cos(x)$, it can be easily verified that the derivative is calculated correctly.

Functions of two or more variables

When we have functions of two or more variables the notion of a derivative changes slightly. For example, when $$ g(x_1, x_2) $$ we define the partial derivatives \begin{eqnarray} \frac{\partial g}{\partial x_1} & , & \frac{\partial g}{\partial x_2} \end{eqnarray}

The collection of partial derivatives can be organized as a vector. This object is known as the gradient and is denoted as \begin{eqnarray} \nabla g(x) \equiv \left(\begin{array}{c} \frac{\partial g}{\partial x_1} \\ \frac{\partial g}{\partial x_2} \end{array} \right) \end{eqnarray}

When taking the partial derivative, we assume that all the variables are constant, apart from the one that we are taking the derivative with respect to.

For example, $$ g(x_1, x_2) = \cos(x_1)e^{3 x_2} $$ When taking the (partial) derivative with respect to $x_1$, we assume that the second factor is a constant $$ \frac{\partial g}{\partial x_1} = -\sin(x_1) e^{3 x_2} $$ Similarly, when taking the partial derivative with respect to $x_2$, we assume that the first factor is a constant $$ \frac{\partial g}{\partial x_2} = 3 \cos(x_1) e^{3 x_2} $$

The chain rule for multiple variables is in a way similar to the chain rule for single variable functions but with a caveat: the derivatives over all paths between the two variables need to be added.

Another example is $$ f(x) = g(h_1(x), h_2(x)) $$ Here, the partial derivative is $$ \frac{\partial g}{\partial x} = \frac{\partial g}{\partial h_1} \frac{\partial h_1}{\partial x} + \frac{\partial g}{\partial h_2} \frac{\partial h_2}{\partial x} $$ The chain rule has a simple form $$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x} $$

To see a concrete example of a function of form $f(x) = g(h_1(x), h_2(x)) $, consider $$ f(x) = \sin(x)\cos(x) $$

We define \begin{align} h_1(x) & = c = \cos(x) \\ h_2(x) & = s = \sin(x) \\ g(c,s) & = g = c \times s \\ f & = g(c,s) \end{align}

that is equivalent to the following program, written deliberately as a sequence of scalar function evaluations and binary operators only

x = 1
c = cos(x)
s = sin(x)
g = c * s
f = g

This program can be represented by the following directed computation graph:

The function can be evaluated by traversing the variable nodes of the directed graph from the inputs to the outputs in the topological order. At each variable node, we merely evaluate the incoming function. Topological order guarantees that the inputs for the function are already calculated.

It is not obvious, but the derivatives can also be calculated easily. By the chain rule, we have \begin{eqnarray} \frac{\partial f}{\partial x} &=& \frac{\partial f}{\partial g} \frac{\partial g}{\partial c} \frac{\partial c}{\partial x} + \frac{\partial f}{\partial g} \frac{\partial g}{\partial s} \frac{\partial s}{\partial x} \\ &=& 1 \cdot s \cdot (-\sin(x)) + 1 \cdot c \cdot \cos(x) \\ &=& -\sin(x) \cdot \sin(x) + 1 \cdot \cos(x) \cdot \cos(x) \\ \end{eqnarray}

The derivative could have been calculated numerically by the following program

df_dx = 0, df_ds = 0, df_dc = 0, df_dg = 0 
df_df = 1

df_dg +=  df_df           // df/dg = 1
df_dc +=  s * df_dg       // dg/dc = s
df_ds +=  c * df_dg       // dg/ds = c
df_dx +=  cos(x) * df_ds  // ds/dx = cos(x)
df_dx += -sin(x) * df_dc  // dc/dx = -sin(x)

Note that the total derivative consists of sums of several terms. Each term is the product of the derivatives along the path leading from $f$ to $x$. In the above example, there are only two paths:

$f,g,c,x$
$f,g,s,x$

$$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial c} \frac{\partial c}{\partial x} + \frac{\partial f}{\partial g} \frac{\partial g}{\partial s} \frac{\partial s}{\partial x} $$

It is not obvious in this simple example but the fact that we are propagating backwards makes us save computation by storing the intermediate variables.

This program can be represented by the following directed computation graph:

Note that during the backward pass, if we traverse variable nodes in the reverse topological order, we only need the derivatives already computed in previous steps and values of variables that are connected to the function node that are computed during the forward pass. As an example, consider

$$ \frac{\partial f}{\partial c} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial c} $$

The first term is already available during the backward pass. The second term needs to be programmed by calculating the partial derivative of $g(s,c) = sc$ with respect to $c$. It has a simple form, namely $s$. More importantly, the numerical value is also immediately available, as it is calculated during the forward pass. For each function type, this calculation will be different but is nevertheless straightforward for all basic functions, including the binary arithmetic operators $+,-,\times$ and $\div$.

Tutorial introductions to Automatic differentiation

Richard D. Neidinger, Introduction to Automatic Differentiation and MATLAB Object-Oriented Programming, SIAM REVIEW, 2010 Society for Industrial and Applied Mathematics, Vol. 52, No. 3, pp. 545–563

Baydin, Atılım Güneş, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2018. “Automatic Differentiation in Machine Learning: a Survey.” Journal of Machine Learning Research (JMLR) https://arxiv.org/pdf/1502.05767.pdf

Two related blog posts from Ben Recht:

http://www.argmin.net/2016/05/18/mates-of-costate/

http://www.argmin.net/2016/05/31/mechanics-of-lagrangians/

Back-propagation, an introduction, by Sanjeev Arora and Tengyu Ma http://www.offconvex.org/2016/12/20/backprop/

A nice autodifferentiation package for python https://github.com/HIPS/autograd

A good tutorial on Backpropagation by Roger Grosse

http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec6.pdf

http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L06%20Backpropagation.pdf



In [1]:

    
from __future__ import absolute_import
import autograd.numpy as np
import matplotlib.pyplot as plt
from autograd import grad

'''
Mathematically we can only take gradients of scalar-valued functions, but
autograd's grad function also handles numpy's familiar vectorization of scalar
functions, which is used in this example.
To be precise, grad(fun)(x) always returns the value of a vector-Jacobian
product, where the Jacobian of fun is evaluated at x and the the vector is an
all-ones vector with the same size as the output of fun. When vectorizing a
scalar-valued function over many arguments, the Jacobian of the overall
vector-to-vector mapping is diagonal, and so this vector-Jacobian product simply
returns the diagonal elements of the Jacobian, which is the gradient of the
function at each input value over which the function is vectorized.
'''

def tanh(x):
    return (1.0 - np.exp(-x))  / (1.0 + np.exp(-x))

x = np.linspace(-7, 7, 200)
plt.plot(x, tanh(x),
         x, grad(tanh)(x),                                 # first derivative
         x, grad(grad(tanh))(x),                           # second derivative
         x, grad(grad(grad(tanh)))(x),                     # third derivative
         x, grad(grad(grad(grad(tanh))))(x),               # fourth derivative
         x, grad(grad(grad(grad(grad(tanh)))))(x),         # fifth derivative
         x, grad(grad(grad(grad(grad(grad(tanh))))))(x))   # sixth derivative

plt.axis('off')
plt.savefig("tanh.png")
plt.show()